DAAA/FT/2A/01 2214296 Ng Wee Herng AIML CA2
Part B: Unsupervised Learning
You are running a shopping mall, and you have some data about your customers like Age, Gender, Income and Spending.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
import plotly.graph_objects as go
from scipy.cluster.hierarchy import dendrogram
from scipy.stats import pearsonr
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClustering, SpectralClustering, DBSCAN
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_samples, silhouette_score
import warnings
data = pd.read_csv('./Customer_Dataset.csv')
data
| CustomerID | Gender | Age | Income (k$) | How Much They Spend | |
|---|---|---|---|---|---|
| 0 | 1 | Male | 19 | 15 | 39 |
| 1 | 2 | Male | 21 | 15 | 81 |
| 2 | 3 | Female | 20 | 16 | 6 |
| 3 | 4 | Female | 23 | 16 | 77 |
| 4 | 5 | Female | 31 | 17 | 40 |
| ... | ... | ... | ... | ... | ... |
| 195 | 196 | Female | 35 | 120 | 79 |
| 196 | 197 | Female | 45 | 126 | 28 |
| 197 | 198 | Male | 32 | 126 | 74 |
| 198 | 199 | Male | 32 | 137 | 18 |
| 199 | 200 | Male | 30 | 137 | 83 |
200 rows × 5 columns
The dataset is collected from the shopping mall I am running (in this background context)
It contains 200 entries with 5 features
CustomerID: Unique identifier for a customer ranging from 0 to 199
Gender: Customer's Gender (male, female)
Income(k$): Customer's income in thousands
Age: Customer's age
How Much They Spend: Spending score assigned to a customer
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 200 entries, 0 to 199 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 200 non-null int64 1 Gender 200 non-null object 2 Age 200 non-null int64 3 Income (k$) 200 non-null int64 4 How Much They Spend 200 non-null int64 dtypes: int64(4), object(1) memory usage: 7.9+ KB
sns.set_style('darkgrid')
sns.set_palette('muted')
sns.countplot(x='Gender', data=data)
plt.title('Countplot of Gender')
plt.show()
We can see that there are more female than male customers at this shopping mall
vars=['Age', 'Income (k$)', 'How Much They Spend']
for var in vars:
sns.histplot(x=var, data=data, kde=True)
plt.title(f'Histogram of {var}')
plt.show()
From the histograms, we can see that Age and Income are slightly skewed to the left while Spending Score looks to have normal distribution
vars=['Age', 'Income (k$)', 'How Much They Spend']
for var in vars:
sns.boxplot(x=var, data=data)
plt.title(f'Boxplot of {var}')
plt.show()
From the boxplots, we can see that:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(21, 7))
sns.barplot(x='Gender', y='How Much They Spend', data=data, ci=None, ax=ax1)
ax1.set_title('Countplot of Gender against Spending Score')
for i in ax1.containers:
ax1.bar_label(i,) # round it
sns.barplot(x='Gender', y='Income (k$)', data=data, ci=None, ax=ax2)
ax2.set_title('Countplot of Gender against Income')
for i in ax2.containers:
ax2.bar_label(i,) # round it
sns.barplot(x='Gender', y='Age', data=data, ci=None, ax=ax3)
ax3.set_title('Countplot of Gender against Age')
for i in ax3.containers:
ax3.bar_label(i,)
plt.show()
From these countplots, we can see that:
correlation_matrix = data[['Age', 'Income (k$)', 'How Much They Spend']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
From this heatmap, we can see that there is a correlation between the features but it is quite weak
sns.lmplot('Age', 'Income (k$)', data=data, hue='Gender')
sns.lmplot('Age', 'How Much They Spend', data=data, hue='Gender')
sns.lmplot('Income (k$)', 'How Much They Spend', data=data, hue='Gender')
plt.show()
C:\Users\ngwee\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( C:\Users\ngwee\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( C:\Users\ngwee\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
From these seaborn lmplots, we can see that gender does not really play a role as the correlations are very similar for males and females.
We will not be using gender for our models
data = data.drop(['CustomerID'], axis=1)
data
| Gender | Age | Income (k$) | How Much They Spend | |
|---|---|---|---|---|
| 0 | Male | 19 | 15 | 39 |
| 1 | Male | 21 | 15 | 81 |
| 2 | Female | 20 | 16 | 6 |
| 3 | Female | 23 | 16 | 77 |
| 4 | Female | 31 | 17 | 40 |
| ... | ... | ... | ... | ... |
| 195 | Female | 35 | 120 | 79 |
| 196 | Female | 45 | 126 | 28 |
| 197 | Male | 32 | 126 | 74 |
| 198 | Male | 32 | 137 | 18 |
| 199 | Male | 30 | 137 | 83 |
200 rows × 4 columns
As the model does not understand categorical values, we will have to change the format of our features to numerical form. There are a few ways to do this.
As Gender is nominal, we will be using OneHotEncoder() from sklearn
data['Gender'].unique()
array(['Male', 'Female'], dtype=object)
Here, we use the OneHotEncoder() from sklearn, dropping the first column with the drop='if_binary' as there are only two categories for both of the features.
enc = OneHotEncoder(drop = 'if_binary')
data[['Gender']] = enc.fit_transform(data[["Gender"]]).toarray()
data
| Gender | Age | Income (k$) | How Much They Spend | |
|---|---|---|---|---|
| 0 | 1.0 | 19 | 15 | 39 |
| 1 | 1.0 | 21 | 15 | 81 |
| 2 | 0.0 | 20 | 16 | 6 |
| 3 | 0.0 | 23 | 16 | 77 |
| 4 | 0.0 | 31 | 17 | 40 |
| ... | ... | ... | ... | ... |
| 195 | 0.0 | 35 | 120 | 79 |
| 196 | 0.0 | 45 | 126 | 28 |
| 197 | 1.0 | 32 | 126 | 74 |
| 198 | 1.0 | 32 | 137 | 18 |
| 199 | 1.0 | 30 | 137 | 83 |
200 rows × 4 columns
Now we can see that Gender is in numerical format, where:
Male → 1, Female → 0
There are two types of feature scaling techniques
$z=\dfrac{(x-u)}{s}$,
where $u$ is the mean and $s$ is the standard deviation of the training samples
$X_{norm}=\dfrac{X-X_{min}}{X_{max}-X_{min}}$
We will be using Standardization with the help of StandardScaler() from sklearn
In clustering analyses, standardization may be especially crucial in order to compare similarities between features based on certain distance measures
(taken from slides in Brightspace)
num_col = ['Age', 'Income (k$)', 'How Much They Spend']
scale = StandardScaler()
data[num_col] = scale.fit_transform(data[num_col])
data
| Gender | Age | Income (k$) | How Much They Spend | |
|---|---|---|---|---|
| 0 | 1.0 | -1.424569 | -1.738999 | -0.434801 |
| 1 | 1.0 | -1.281035 | -1.738999 | 1.195704 |
| 2 | 0.0 | -1.352802 | -1.700830 | -1.715913 |
| 3 | 0.0 | -1.137502 | -1.700830 | 1.040418 |
| 4 | 0.0 | -0.563369 | -1.662660 | -0.395980 |
| ... | ... | ... | ... | ... |
| 195 | 0.0 | -0.276302 | 2.268791 | 1.118061 |
| 196 | 0.0 | 0.441365 | 2.497807 | -0.861839 |
| 197 | 1.0 | -0.491602 | 2.497807 | 0.923953 |
| 198 | 1.0 | -0.491602 | 2.917671 | -1.250054 |
| 199 | 1.0 | -0.635135 | 2.917671 | 1.273347 |
200 rows × 4 columns
As we can see, the numerical columns have been standardized
However, our dataset has only 4 features, which means that we will not be performing dimension reduction as the number of features are already quite small
K-means is a simple algorithm for clustering data, assigning each data point to one of the clusters by minimizing the WCSS (within cluster sum of squares)
Inertia is the sum of squared distance of samples to their closest cluster center.
warnings.filterwarnings("ignore")
inertia = []
for k in range(2,11):
kmeans = KMeans(n_clusters=k, init='k-means++', random_state=1)
kmeans.fit(data[['Age', 'Income (k$)', 'How Much They Spend']])
inertia.append(np.sqrt(kmeans.inertia_))
plt.plot(range(2, 11), inertia, marker="s")
plt.title("Inertia for each K")
plt.xlabel("$k$")
plt.ylabel("Inertia")
plt.show()
To find the number of k, we look for the point where the curve flattens.
However, it is not very clear, so we will move on to another method for determining k, by comparing silhouette score
The silhouette coefficient for a given sample is given as:
$s = \frac{b - a}{\max(a, b)}$
silhouette = []
for k in range(2, 11):
model = KMeans(n_clusters=k, init='k-means++', random_state=1).fit(data[['Age', 'Income (k$)', 'How Much They Spend']])
label = model.labels_
sil_coeff = silhouette_score(data[['Age', 'Income (k$)', 'How Much They Spend']], label, metric='euclidean')
silhouette.append(sil_coeff)
print(f"For n_clusters={k}, The Silhouette Coefficient is {round(sil_coeff, 3)}")
plt.plot(range(2, 11), silhouette, marker="s")
plt.title("Silhouette Score for each K")
plt.xlabel("$k$")
plt.ylabel("Silhoutte Score")
plt.show()
For n_clusters=2, The Silhouette Coefficient is 0.335 For n_clusters=3, The Silhouette Coefficient is 0.358 For n_clusters=4, The Silhouette Coefficient is 0.404 For n_clusters=5, The Silhouette Coefficient is 0.417 For n_clusters=6, The Silhouette Coefficient is 0.427 For n_clusters=7, The Silhouette Coefficient is 0.418 For n_clusters=8, The Silhouette Coefficient is 0.408 For n_clusters=9, The Silhouette Coefficient is 0.419 For n_clusters=10, The Silhouette Coefficient is 0.4
From the silhouette plot, we can see that k=6 gives the highest silhouette score, and we will be using that for our clustering model
kmeans = KMeans(n_clusters=6, init='k-means++').fit(data[['Age', 'Income (k$)', 'How Much They Spend']])
kmeans_clustered = data[['Age', 'Income (k$)', 'How Much They Spend']].copy()
kmeans_clustered.loc[:,'Cluster'] = kmeans.labels_
Here we fit the kmeans model with the k-means++ initialization. This algorithm guarantees a more intelligent introduction of the centroids and improves the nature of the clustering.
km_sizes = kmeans_clustered.groupby('Cluster').size().to_frame()
km_sizes.columns = ["KM_size"]
km_sizes
| KM_size | |
|---|---|
| Cluster | |
| 0 | 39 |
| 1 | 39 |
| 2 | 45 |
| 3 | 33 |
| 4 | 21 |
| 5 | 23 |
kmeans_clustered
| Age | Income (k$) | How Much They Spend | Cluster | |
|---|---|---|---|---|
| 0 | -1.424569 | -1.738999 | -0.434801 | 5 |
| 1 | -1.281035 | -1.738999 | 1.195704 | 5 |
| 2 | -1.352802 | -1.700830 | -1.715913 | 4 |
| 3 | -1.137502 | -1.700830 | 1.040418 | 5 |
| 4 | -0.563369 | -1.662660 | -0.395980 | 4 |
| ... | ... | ... | ... | ... |
| 195 | -0.276302 | 2.268791 | 1.118061 | 0 |
| 196 | 0.441365 | 2.497807 | -0.861839 | 3 |
| 197 | -0.491602 | 2.497807 | 0.923953 | 0 |
| 198 | -0.491602 | 2.917671 | -1.250054 | 3 |
| 199 | -0.635135 | 2.917671 | 1.273347 | 0 |
200 rows × 4 columns
As there are more than 2 features, we will plot 2 scatter plots, and then a 3d plot with the help of plotly for visualization
fig=plt.subplots(figsize=(10,8))
sns.scatterplot(x='Income (k$)', y='How Much They Spend', data=kmeans_clustered, hue='Cluster', palette='Set1', legend='full')
plt.title('Scatterplot on Clusters of Income vs Spending Score')
plt.show()
fig=plt.subplots(figsize=(10,8))
sns.scatterplot(x='Age', y='How Much They Spend', data=kmeans_clustered, hue='Cluster', palette='Set1', legend='full')
plt.title('Scatterplot on Clusters of Age vs Spending Score')
plt.show()
x = data[['Age','Income (k$)','How Much They Spend']].values
# 3d scatterplot using plotly
Scene = dict(xaxis = dict(title = 'Age'),yaxis = dict(title = 'Income (k$)'),zaxis = dict(title = 'How Much They Spend'))
# model.labels_ is nothing but the predicted clusters i.e y_clusters
labels = kmeans.labels_
trace = go.Scatter3d(x=x[:, 0], y=x[:, 1], z=x[:, 2], mode='markers',marker=dict(color = labels, size= 10, line=dict(color= 'black',width = 10)))
layout = go.Layout(margin=dict(l=0,r=0),scene = Scene,height = 800,width = 800)
data1 = [trace]
fig = go.Figure(data = data1, layout = layout)
fig.show()
sil_km = silhouette_score(data[['Age', 'Income (k$)', 'How Much They Spend']], kmeans.labels_)
print(sil_km)
0.4284167762892593
Agglomerative Clustering is a type of hierarchical clustering that divides the population into clusters.
To choose the number of k in this case, we will be plotting a dendrogram
def plot_dendrogram(model, **kwargs):
counts = np.zeros(model.children_.shape[0])
n_samples = len(model.labels_)
for i, merge in enumerate(model.children_):
current_count = 0
for child_idx in merge:
if child_idx < n_samples:
current_count += 1 # leaf node
else:
current_count += counts[child_idx - n_samples]
counts[i] = current_count
linkage_matrix = np.column_stack(
[model.children_, model.distances_, counts]
).astype(float)
dendrogram(linkage_matrix, **kwargs)
X = data[['Age', 'Income (k$)', 'How Much They Spend']]
model = AgglomerativeClustering(distance_threshold=0, n_clusters=None)
model = model.fit(X)
plt.title("Hierarchical Clustering Dendrogram")
plot_dendrogram(model, truncate_mode="level", p=3)
plt.xlabel("Number of points in node (or index of point if no parenthesis).")
plt.hlines(7.5, 0, 300, colors="r", linestyle=":")
plt.show()
From this dendrogram, we cut the clusters at around the 7.5 mark, and counting the number of horizontal lines from that cut, there are 6 clusters.
We will fit the model with 6 clusters
# Setting Number of Cluster to 6
agg_cluster_6 = AgglomerativeClustering(n_clusters=6).fit(X)
agg_clustered = data[['Age', 'Income (k$)', 'How Much They Spend']].copy()
agg_clustered.loc[:,'Cluster'] = agg_cluster_6.labels_
plt.figure(figsize=(10, 8))
sns.scatterplot(x='Age', y='How Much They Spend', data=agg_clustered, hue='Cluster', palette='Set1', legend='full')
plt.title('Scatterplot on Clusters of Age vs Spending Score')
plt.show()
plt.figure(figsize=(10, 8))
sns.scatterplot(x='Income (k$)', y='How Much They Spend', data=agg_clustered, hue='Cluster', palette='Set1', legend='full')
plt.title('Scatterplot on Clusters of Income vs Spending Score')
plt.show()
# 3d scatterplot using plotly
Scene = dict(xaxis = dict(title = 'Age'),yaxis = dict(title = 'Income (k$)'),zaxis = dict(title = 'How Much They Spend'))
# model.labels_ is nothing but the predicted clusters i.e y_clusters
labels = agg_cluster_6.labels_
trace = go.Scatter3d(x=x[:, 0], y=x[:, 1], z=x[:, 2], mode='markers',marker=dict(color = labels, size= 10, line=dict(color= 'black',width = 10)))
layout = go.Layout(margin=dict(l=0,r=0),scene = Scene,height = 800,width = 800)
data1 = [trace]
fig = go.Figure(data = data1, layout = layout)
fig.show()
sil_agg = silhouette_score(data[['Age', 'Income (k$)', 'How Much They Spend']], agg_cluster_6.labels_)
print(sil_agg)
0.4201169558789579
From the plots, we can see that K-means and Agglomerative Clustering are similar, with K-means having a bit higher of a silhouette score
Spectral clustering is a technique with roots in graph theory, where the approach is used to identify communities of nodes in a graph based on the edges connecting them.
We will be using the silhouette plot again to find the optimal number of k
silhouette = []
for k in range(2, 11):
model = SpectralClustering(n_clusters=k, assign_labels="discretize", random_state=1).fit(data[['Age', 'Income (k$)', 'How Much They Spend']])
label = model.labels_
sil_coeff = silhouette_score(data[['Age', 'Income (k$)', 'How Much They Spend']], label, metric='euclidean')
silhouette.append(sil_coeff)
print(f"For n_clusters={k}, The Silhouette Coefficient is {round(sil_coeff, 3)}")
plt.plot(range(2, 11), silhouette, marker="s")
plt.title("Silhouette Score for each K")
plt.xlabel("$k$")
plt.ylabel("Silhoutte Score")
plt.show()
For n_clusters=2, The Silhouette Coefficient is 0.334 For n_clusters=3, The Silhouette Coefficient is 0.354 For n_clusters=4, The Silhouette Coefficient is 0.404 For n_clusters=5, The Silhouette Coefficient is 0.38 For n_clusters=6, The Silhouette Coefficient is 0.429 For n_clusters=7, The Silhouette Coefficient is 0.422 For n_clusters=8, The Silhouette Coefficient is 0.394 For n_clusters=9, The Silhouette Coefficient is 0.386 For n_clusters=10, The Silhouette Coefficient is 0.408
From the silhouette plot, it seems that the optimal number of k is also 6, thus we will be implementing that into our model
spectral = SpectralClustering(n_clusters=6, assign_labels="discretize", random_state=1).fit(data[['Age', 'Income (k$)', 'How Much They Spend']])
spec_clustered = data[['Age', 'Income (k$)', 'How Much They Spend']].copy()
spec_clustered.loc[:,'Cluster'] = spectral.labels_
spec_clustered
| Age | Income (k$) | How Much They Spend | Cluster | |
|---|---|---|---|---|
| 0 | -1.424569 | -1.738999 | -0.434801 | 4 |
| 1 | -1.281035 | -1.738999 | 1.195704 | 4 |
| 2 | -1.352802 | -1.700830 | -1.715913 | 0 |
| 3 | -1.137502 | -1.700830 | 1.040418 | 4 |
| 4 | -0.563369 | -1.662660 | -0.395980 | 0 |
| ... | ... | ... | ... | ... |
| 195 | -0.276302 | 2.268791 | 1.118061 | 1 |
| 196 | 0.441365 | 2.497807 | -0.861839 | 3 |
| 197 | -0.491602 | 2.497807 | 0.923953 | 1 |
| 198 | -0.491602 | 2.917671 | -1.250054 | 3 |
| 199 | -0.635135 | 2.917671 | 1.273347 | 1 |
200 rows × 4 columns
plt.figure(figsize=(10, 8))
sns.scatterplot(x='Age', y='How Much They Spend', data=spec_clustered, hue='Cluster', palette='Set1', legend='full')
plt.title('Scatterplot on Clusters of Age vs Spending Score')
plt.show()
plt.figure(figsize=(10, 8))
sns.scatterplot(x='Income (k$)', y='How Much They Spend', data=spec_clustered, hue='Cluster', palette='Set1', legend='full')
plt.title('Scatterplot on Clusters of Income vs Spending Score')
plt.show()
# 3d scatterplot using plotly
Scene = dict(xaxis = dict(title = 'Age'),yaxis = dict(title = 'Income (k$)'),zaxis = dict(title = 'How Much They Spend'))
# model.labels_ is nothing but the predicted clusters i.e y_clusters
labels = spectral.labels_
trace = go.Scatter3d(x=x[:, 0], y=x[:, 1], z=x[:, 2], mode='markers',marker=dict(color = labels, size= 10, line=dict(color= 'black',width = 10)))
layout = go.Layout(margin=dict(l=0,r=0),scene = Scene,height = 800,width = 800)
data1 = [trace]
fig = go.Figure(data = data1, layout = layout)
fig.show()
sil_spec = silhouette_score(data[['Age', 'Income (k$)', 'How Much They Spend']], spectral.labels_)
print(sil_spec)
0.4293614515358638
We can see that Spectral Clustering is again, very similar to K-Means and Agglomerative Clustering from both a visual standpoint and by the value of the silhouette score
DBSCAN groups 'densely grouped' data points into a single cluster. It can identify clusters in large spatial datasets by looking at the local density of the data points.
DBSCAN needs hyperparameter tuning with its 2 parameters, epsilon and min point.
We will be using DBSCAN's default parameters first before hyperparameter tuning.
db = DBSCAN(n_jobs=-1).fit(data[['Age', 'Income (k$)', 'How Much They Spend']])
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print(silhouette_score(data[['Age', 'Income (k$)', 'How Much They Spend']], labels))
0.18451372756506046
The silhouette score is quite low compared to the other models, which may mean that the model needs further tuning. Firstly, we will plot out the untuned DBSCAN to see how it performs
plt.figure(figsize=(10, 10))
unique_labels = set(labels)
colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
col = [0, 0, 0, 1]
class_member_mask = labels == k
xy = data[['Age', 'Income (k$)', 'How Much They Spend']][class_member_mask & core_samples_mask]
plt.plot(
xy.iloc[:, 0],
xy.iloc[:, 1],
"o",
markerfacecolor=tuple(col),
markeredgecolor="k",
markersize=14,
)
xy = data[['Age', 'Income (k$)', 'How Much They Spend']][class_member_mask & ~core_samples_mask]
plt.plot(
xy.iloc[:, 0],
xy.iloc[:, 1],
"o",
markerfacecolor=tuple(col),
markeredgecolor="k",
markersize=6,
)
plt.title("Estimated number of clusters: %d" % n_clusters_)
plt.show()
We see that actually DBSCAN did form 6 clusters, which does align with our models that we have implemented above. This could mean that the current epsilon value (default = 0.5) is optimal for this data.
However, there are many dots that are black that are considered outliers, and so we need to tune the min point parameter (default = 5) to try to reduce the number of outliers to improve the clustering
For hyperparameter tuning, we will be looping through different values of epsilon and min points.
Since we are satisfied with the epsilon value as it formed a reasonable number of clusters, we will tune the min_samples parameter
min_samples_grid = np.arange(3, 10)
scoreArr = []
for min_samples in min_samples_grid:
db = DBSCAN(eps=0.5,min_samples=min_samples,n_jobs=-1).fit(data[['Age', 'Income (k$)', 'How Much They Spend']])
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
scoreArr.append([str([min_samples]),silhouette_score(data[['Age', 'Income (k$)', 'How Much They Spend']], labels)])
scoreArr = pd.DataFrame(scoreArr).rename({0:"Params",1:"Silhouette Score"},axis=1).sort_values(['Silhouette Score'],ascending=False)
scoreArr.head(10)
| Params | Silhouette Score | |
|---|---|---|
| 4 | [7] | 0.223615 |
| 3 | [6] | 0.207328 |
| 2 | [5] | 0.184514 |
| 0 | [3] | 0.131576 |
| 5 | [8] | 0.120539 |
| 1 | [4] | 0.112211 |
| 6 | [9] | 0.109218 |
It seems that the optimal min_samples is 7, as the silhouette score increases until 7 and starts to decrease from then
db = DBSCAN(eps=0.5, min_samples=7, n_jobs=-1).fit(data[['Age', 'Income (k$)', 'How Much They Spend']])
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
sil_db = silhouette_score(data[['Age', 'Income (k$)', 'How Much They Spend']], labels)
print(sil_db)
plt.figure(figsize=(10, 10))
unique_labels = set(labels)
colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
col = [0, 0, 0, 1]
class_member_mask = labels == k
xy = data[['Age', 'Income (k$)', 'How Much They Spend']][class_member_mask & core_samples_mask]
plt.plot(
xy.iloc[:, 0],
xy.iloc[:, 1],
"o",
markerfacecolor=tuple(col),
markeredgecolor="k",
markersize=14,
)
xy = data[['Age', 'Income (k$)', 'How Much They Spend']][class_member_mask & ~core_samples_mask]
plt.plot(
xy.iloc[:, 0],
xy.iloc[:, 1],
"o",
markerfacecolor=tuple(col),
markeredgecolor="k",
markersize=6,
)
plt.title("Estimated number of clusters: %d" % n_clusters_)
plt.show()
0.2236154132178232
After tuning, we see that DBSCAN has formed 4 clusters, and the silhouette score has improved from the untuned DBSCAN model
Similar to DBSCAN, we are looping through possible values for each parameter in GMM, like a GridSearch, and instead of silhouette score, we are using Bayesian Information Criterion (BIC) as an evaluation metric.
BIC gives us an estimation on how good the GMM is in terms of predicting the data we actually have. The lower is the BIC, the better the model
n_components = range(1, 21)
covariance_type = ['spherical', 'tied', 'diag', 'full']
score=pd.DataFrame(columns=['cov', 'n_comp', 'bic'])
for cov in covariance_type:
for n_comp in n_components:
gmm=GaussianMixture(n_components=n_comp,covariance_type=cov, random_state=3)
gmm.fit(data[['Age', 'Income (k$)', 'How Much They Spend']])
bic_score = gmm.bic(data[['Age', 'Income (k$)', 'How Much They Spend']])
score = score.append({'cov': cov, 'n_comp': n_comp, 'bic': bic_score}, ignore_index=True)
score = score.sort_values(by="bic", ascending=True)
score.head(5)
| cov | n_comp | bic | |
|---|---|---|---|
| 45 | diag | 6 | 1535.368372 |
| 46 | diag | 7 | 1557.160249 |
| 7 | spherical | 8 | 1559.424944 |
| 6 | spherical | 7 | 1562.172506 |
| 9 | spherical | 10 | 1565.633567 |
Here we see that the parameters that give us the lowest BIC is a covariance type of diagonal and 6 components
gmm = GaussianMixture(n_components=6, covariance_type='diag', random_state=1).fit(data[['Age', 'Income (k$)', 'How Much They Spend']])
gmm_clustered = data[['Age', 'Income (k$)', 'How Much They Spend']].copy()
gmm_clustered.loc[:,'Cluster'] = gmm.predict(data[['Age', 'Income (k$)', 'How Much They Spend']])
gmm_clustered
| Age | Income (k$) | How Much They Spend | Cluster | |
|---|---|---|---|---|
| 0 | -1.424569 | -1.738999 | -0.434801 | 1 |
| 1 | -1.281035 | -1.738999 | 1.195704 | 0 |
| 2 | -1.352802 | -1.700830 | -1.715913 | 1 |
| 3 | -1.137502 | -1.700830 | 1.040418 | 0 |
| 4 | -0.563369 | -1.662660 | -0.395980 | 1 |
| ... | ... | ... | ... | ... |
| 195 | -0.276302 | 2.268791 | 1.118061 | 4 |
| 196 | 0.441365 | 2.497807 | -0.861839 | 2 |
| 197 | -0.491602 | 2.497807 | 0.923953 | 4 |
| 198 | -0.491602 | 2.917671 | -1.250054 | 2 |
| 199 | -0.635135 | 2.917671 | 1.273347 | 4 |
200 rows × 4 columns
plt.figure(figsize=(10, 8))
sns.scatterplot(x='Age', y='How Much They Spend', data=gmm_clustered, hue='Cluster', palette='Set1', legend='full')
plt.title('Scatterplot on Clusters of Age vs Spending Score')
plt.show()
plt.figure(figsize=(10, 8))
sns.scatterplot(x='Income (k$)', y='How Much They Spend', data=gmm_clustered, hue='Cluster', palette='Set1', legend='full')
plt.title('Scatterplot on Clusters of Income vs Spending Score')
plt.show()
# 3d scatterplot using plotly
Scene = dict(xaxis = dict(title = 'Age'),yaxis = dict(title = 'Income (k$)'),zaxis = dict(title = 'How Much They Spend'))
# model.labels_ is nothing but the predicted clusters i.e y_clusters
labels = gmm.predict(data[['Age', 'Income (k$)', 'How Much They Spend']])
trace = go.Scatter3d(x=x[:, 0], y=x[:, 1], z=x[:, 2], mode='markers',marker=dict(color = labels, size= 10, line=dict(color= 'black',width = 10)))
layout = go.Layout(margin=dict(l=0,r=0),scene = Scene,height = 800,width = 800)
data1 = [trace]
fig = go.Figure(data = data1, layout = layout)
fig.show()
sil_gmm = silhouette_score(data[['Age', 'Income (k$)', 'How Much They Spend']], gmm.predict(data[['Age', 'Income (k$)', 'How Much They Spend']]))
print(sil_gmm)
0.4088587422417899
After fitting all our models to the data, we will choose one model to interpret its clusters and find out characteristics about each cluster
print(f'K-Means: {sil_km}')
print(f'Agglomerative: {sil_agg}')
print(f'Spectral: {sil_spec}')
print(f'DBSCAN: {sil_db}')
print(f'Gaussian Mixture Model: {sil_gmm}')
K-Means: 0.4284167762892593 Agglomerative: 0.4201169558789579 Spectral: 0.4293614515358638 DBSCAN: 0.2236154132178232 Gaussian Mixture Model: 0.4088587422417899
Using silhouette score as an evaluation metric across all models, we can see that:
Since we have found out before that Spectral Clustering is pretty similar to K-Means, we will choose the Spectral Clustering Model for interpretation
plt.figure(figsize=(10, 8))
sns.scatterplot(x='Age', y='How Much They Spend', data=spec_clustered, hue='Cluster', palette='Set1', legend='full')
plt.title('Scatterplot on Clusters of Age vs Spending Score')
plt.show()
plt.figure(figsize=(10, 8))
sns.scatterplot(x='Income (k$)', y='How Much They Spend', data=spec_clustered, hue='Cluster', palette='Set1', legend='full')
plt.title('Scatterplot on Clusters of Age vs Spending Score')
plt.show()
data_int = pd.read_csv('./Customer_Dataset.csv')
data_int = data_int[['Age', 'Income (k$)', 'How Much They Spend']]
data_int['Cluster'] = spec_clustered['Cluster']
data_int
| Age | Income (k$) | How Much They Spend | Cluster | |
|---|---|---|---|---|
| 0 | 19 | 15 | 39 | 4 |
| 1 | 21 | 15 | 81 | 4 |
| 2 | 20 | 16 | 6 | 0 |
| 3 | 23 | 16 | 77 | 4 |
| 4 | 31 | 17 | 40 | 0 |
| ... | ... | ... | ... | ... |
| 195 | 35 | 120 | 79 | 1 |
| 196 | 45 | 126 | 28 | 3 |
| 197 | 32 | 126 | 74 | 1 |
| 198 | 32 | 137 | 18 | 3 |
| 199 | 30 | 137 | 83 | 1 |
200 rows × 4 columns
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(30, 10))
sns.barplot(data=data_int, y=data_int.columns[0], x="Cluster", ax=ax1)
ax1.set_title(f"Average {data_int.columns[0]} based on clusters")
sns.barplot(data=data_int, y=data_int.columns[1], x="Cluster", ax=ax2)
ax2.set_title(f"Average {data_int.columns[1]} based on clusters")
sns.barplot(data=data_int, y=data_int.columns[2], x="Cluster", ax=ax3)
ax3.set_title(f"Average {data_int.columns[2]} based on clusters")
plt.show()
While we interpret the clusters, we will try to compare them with the types of customers that were researched from the background information
Doing this assignment, I learnt how to implement unsupervised learning through clustering, trying different models and interpret clusters in a given context.
However, it is important to note that there are not many features, which can limit the information that we can extract from these clusters.
Personally, I found it challenging at the start as I was very confused on how unsupervised learning works. However, after trial and error, I was understanding more about it and starting to get a grasp. I was also unsure on how to implement hyperparameter tuning, after which I found out about DBSCAN and Gaussian Mixture Models, where I looped through parameters to find what works best.